[Python for data analysis] - Diabetes Dataset

<<<<<<< HEAD



=======

>>>>>>> 98d343efacd2506462a77b07e8ea11626b539aee

Introduction

This dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.

The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.

Dataset can be found at https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008

<<<<<<< HEAD

Table Of Contents

Table Of Contents

<<<<<<< HEAD =======
>>>>>>> 98d343efacd2506462a77b07e8ea11626b539aee

Data Exploration

Importation of all the librairies that we will use

Random State

We define a seed to have reproducible results.

Display settings

We set the bokeh output to notebook

We define a permanent style for the plot in order to have an homogenous vizualisation

All bokeh and plotly graphs are dynamic, so you can move the mouse over an element to see which data he refers to.
You can also interact with the legend of the plot to show/hide some categories

We set the maximum number of columns allowed to be displayed to 100 so that the all dataframe can be printed

Importation of the dataset

We create a dataframe by reading a csv file and importing the data

By fisrt obersving the data, we can see that there are a lot of values in this dataset. There are 101 766 row and 50 columns. We will certainly have to analyze and clean up these data a little to reduce them and make them more treatable.

First overview

Let's use the info() method in order to observe the structure and the type of the dataframe

Columns signification

Now we want to know and understand the signification of the differents columns in order to analyse them. With this dataset, we have a pdf document explaining some characteristics of these data:

Here, we will use the pandas_profiling library to call the profil_report() function. This library extends the DataFrame for a quick analysis

This report contains several informations such as the types of columns in the dataframe, the uniques and missing values, it will do some quantile and descriptive statistics, will search some correlations between the differents features. We will use it as a base to clean up our data.

Profile report

<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 98d343efacd2506462a77b07e8ea11626b539aee =======
>>>>>>> 007713fd3640aec1391b4d043eaf112c8acff68d
from sklearn.decomposition import PCA pca = PCA() x_train_pca = pca.fit_transform(x_train) x_test_pca = pca.transform(x_test)colonnes = list(diabetes_df_ml.columns) colonnes.remove('readmitted')colors = ['red' if x == 1 else 'blue' for x in y_train] #markers = ['.' if x == 1 else '.' for x in y_train] pd.DataFrame(x_train_pca).plot(kind = 'scatter', x = 1, y = 0, c = colors, marker = ".",s = 2, figsize = (20,20)) new_diabetes_df = diabetes_df_ml.drop(columns = ["race", "gender", "number_outpatient", "diag_1", "diag_2", "diag_3", "number_emergency", "max_glu_serum", "repaglinide", "nateglinide", "chlorpropamide", "acetohexamide", "tolbutamide", "acarbose", "miglitol", "troglitazone", "tolazamide", "glyburide-metformin", "glipizide-metformin", ])